Prosper is a website that connects lenders with borrowers. It takes data from the lenders like credit score, employment history, and the reason for getting the loan and provides it to lenders. Lenders can choose who they wish to loan money to. Prosper provides a dataset of over 113,000 loans that were made with over 80 data points on each loan. This investigation sorts through the dataset to find interesting trends and create (hopefully) insightful graphics depicting these trends.
library(ggplot2)
library(GGally)
library(RColorBrewer)
library(corrplot)
setwd("/home/nathan/Desktop/Nanodegree/P4 - Loan Data Visualization with R")
loan = read.csv('prosperLoanData.csv')
Listing data was provided as a numerical value with a separate legend. This section transforms it to a factor variable and adds back in the descriptions for each number. Also in this section the Occupation variable is reordered so that the boxplot later on is easier to read.
Current Delinquencies is cut into 4 buckets to make a variable that can be more easily used to show color on a plot. The table function is used before cut to determine how to equally size the buckets.
a <-loan$ListingCategory..numeric.
loan$ListingCategory[a == 0] <- "0 - Not Available"
loan$ListingCategory[a == 1] <- "1 - Debt Consolidation"
loan$ListingCategory[a == 2] <- "2 - Home Improvement"
loan$ListingCategory[a == 3] <- "3 - Business"
loan$ListingCategory[a == 4] <- "4 - Personal Loan"
loan$ListingCategory[a == 5] <- "5 - Student Use"
loan$ListingCategory[a == 6] <- "6 - Auto"
loan$ListingCategory[a == 7] <- "7- Other"
loan$ListingCategory[a == 8] <- "8 - Baby&Adoption"
loan$ListingCategory[a == 9] <- "9 - Boat"
loan$ListingCategory[a == 10] <- "10 - Cosmetic Procedure"
loan$ListingCategory[a == 11] <- "11 - Engagement Ring"
loan$ListingCategory[a == 12] <- "12 - Green Loans"
loan$ListingCategory[a == 13] <- "13 - Household Expenses"
loan$ListingCategory[a == 14] <- "14 - Large Purchases"
loan$ListingCategory[a == 15] <- "15 - Medical/Dental"
loan$ListingCategory[a == 16] <- "16 - Motorcycle"
loan$ListingCategory[a == 17] <- "17 - RV"
loan$ListingCategory[a == 18] <- "18 - Taxes"
loan$ListingCategory[a == 19] <- "19 - Vacation"
loan$ListingCategory[a == 20] <- "20 - Wedding Loans"
loan$ListingCategory <- as.factor(loan$ListingCategory)
loan$ListingCategory <- with(loan,
reorder(ListingCategory, BorrowerRate, median))
loan$Occupation <- with(loan, reorder(Occupation, BorrowerRate, median))
loan$CurDelBucket <- cut(loan$CurrentDelinquencies, c(-Inf,0,1,5,+Inf))
After the libraries and data are loaded, the next step is to take a look at the list of variables in the data and plot single variables (univariate plots) on a histogram or box plot to get an idea of the range of different variables. This will give an idea of what could be explored later.
summary(loan$CreditScoreRangeLower)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 660.0 680.0 685.6 720.0 880.0 591
Most credit scores for people getting loans on Prosper are between 600 and 800 with the mean around 685. It would be interesting to compare this trend to the credit scores of the general population.
summary(loan$BorrowerRate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
The rate that most borrowers are charged looks like a normal distribution with mean around 19%. There are some outliers on the higher end of the spectrum that may need to be dealt with. This variable will be a key part of the bivariate and multivariate analysis because for a potential borrower, it is useful to know the factors that affect the interest rate that they will ultimately be paying.
summary(loan$EstimatedLoss)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.005 0.042 0.072 0.080 0.112 0.366 29084
The mean estimated loss is around 8% and the third quartile is 11.2%. This indicates that most loans lose less than 15%.
summary(loan$EstimatedReturn)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.183 0.074 0.092 0.096 0.117 0.284 29084
Nearly all loans have a positive Estimated Return. This is to be expected because a lender would likely not make a load if the estimated return is negative.
summary(loan$OpenRevolvingAccounts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 4.00 6.00 6.97 9.00 51.00
This trend shows a positive skew. However the tail of the distribution is not long enough to use a log scale on the x axis to attempt to make the distribution more normal. The plot below changes the x axis to a logaritmic scale and you can see that the numbers are too clustered around 1 for it to transform the graph properly into a normal distribution.
summary(loan$OpenRevolvingMonthlyPayment)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 114.0 271.0 398.3 525.0 14980.0
The trend for open revolving monthly payment is similar the the revolving accounts trend where it is positively skewed and has a long tail. The values in this chart are sufficiently large that it responds better to a log 10 transformation of the x axis. The resulting distribution is below and is pretty close to a normal distribution.
summary(loan$AmountDelinquent)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 0.0 984.5 0.0 463900.0 7622
Amount delinquent is also a very clear long tail trend. Below is the same plot with the x axis transformed to log10 scale. To create these plots, ggplot was used instead of the prebuilt function because the 0 values for AmountDelinquent had to be excluded from the data set.
In this case transforming the x axis to a log scale gives what appears to be a normal distribution. This means tools like z-scores to predict delinquent amount where a certain percentage of the sample would be below or above that value will work.
summary(loan$BankcardUtilization)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.310 0.600 0.561 0.840 5.950 7604
This trend has a fairly linear distribution and does not appear that it would respond well to a trasformation or the assumptions of a normal distribution.
summary(loan$DebtToIncomeRatio)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
The debt to income ratio trend shows positive skew and this is confirmed by the fact that the mean (~28%) is much higher than the median (22%).
In this section the relationship between several pairs of variables will be plotted. Again this is to get an idea of what kinds of relationships exist in the data to see what trends are worth exploring further. The first plots will be a correlation matrices that plot several variables against each other in a matrix to see if there are any interesting relationships. The correlation matrices are targetted at the Borrower Rate and Credit Score variables. In the first 2 matrices, the borrower rate plotted against 9 other variables to see which relationships might be worth investigating further. The second 2 matrices plot credit score against several variables for the same purpose.
As shown in the correlation matrix, there is a very high correlation between Borrower Rate and Estimated Loss. This is expected because Borrowers that are seen as riskier will be given higher interest rates. The expected loss will also be higher for borrowers that are seen as risky. This is the expected relationship that economic theory predicts - higher risk (estimated loss) demands a higher return (interest rate).
Interestingly, Employment Duration (EmplDur), Revolving Monthly Payment (RevMnthPmt), Amount Delinquent (AmntDelinq) and Debt to Income Ratio (DtoIRatio) have no significant correlation with Borrower Rate. I would have predicted that Debt to Income Ratio would have some correlation with borrower rate as would seem to be a clear indicator of risk.
Borrower Rate has a moderate negative correlation with Credit Score (CScoreLow) (negative correlation). Credit Score is expected to be correlated because the credit score metric attempts to quantify how likely a borrower is to pay back a loan.
There are 2 variables appear to be correlated with Borrower Rate. 1. Bank Card Utilization has a weak positive correlation with Borrower Rate. This seems to make sense in that borrowers who have maxed out their current credit lines would seem to be a higher risk. However, if a borrower is only applying for their second or third credit line, they would seem to be less risky than someone applying for their seventh or eighth line of credit as they likely have less credit. This multivariate relationship will be studied more in section 1.4 Multivariate Exploration. 2. Available Bank Credit has a somewhat stronger negative correlation with Borrower Rate. This is interesting in that higher available credit correlates with a lower Borrower Rate and lower credit utilization correlates with a lower Borrower Rate. A deeper exploration of the multivariate relationship between amount of credit, credit utilization and borrower rate will follow in section 1.4.
Most variables in this plot show very weak or no correlation with credit score. These variables do not show much promise for continued exploration.
A number of the variables in this plot show weak to moderate correlations with credit score. The list below shows the variable name and the magnitude and direction of correlation with credit score. - Current Delinquencies - moderate, negative - Delinquencies Last 7 Years - moderate, negative - Bankcard Utilization - moderate, negative - Available Bankcard Credit - moderate, positive - Stated Monthly Income - very weak, positive
Some of these variables will be explored more deeply in section 1.4. This is a good list to start building a model for predicting credit scores.
cor.test(loan$CreditScoreRangeLower, loan$BorrowerRate)
##
## Pearson's product-moment correlation
##
## data: loan$CreditScoreRangeLower and loan$BorrowerRate
## t = -175.17, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4661358 -0.4569730
## sample estimates:
## cor
## -0.4615667
The plot seems to show a negative correlation as expected by the -0.46 correlation coefficient, but it is not that easy to see with the scatter plot. Adding the mean, median, 10% and 90% quantile summary lines would give a clearer indication of what is happening with these 2 variables.
The summary lines make the trend much clearer. The interest rate that the borrower pays (borrower rate) decreases as the credit score increases, however there is still a lot of variation in the interest rate for a given credit score. Credit score is clearly one component that determines the borrower rate but it does not tell the whole story. Other variables that correlate with borrower rate will be worth investigating, especially for cases where the credit score is below 600.
This plot also clearly shows that there are fewer loans established when the borrower has a credit score lower than 600 or higher than 800. This could be for a number of reasons but 3 reasons that could be worth investigating would be: - borrowers with lower credit scores are more likely to be denied loans - people with high credit scores are less likely to borrow money - the majority of people have credit scores between 600 and 800 and this is why there are more loans established
Unfortunately more data would be required to investigate this further. The distribution of credit scores from a random sample of the general population and information on applicants who were denied a loan would be required to investigate further.
cor.test(loan$BankcardUtilization, loan$BorrowerRate)
##
## Pearson's product-moment correlation
##
## data: loan$BankcardUtilization and loan$BorrowerRate
## t = 86.168, df = 106330, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2498551 0.2610917
## sample estimates:
## cor
## 0.255482
There is a weak positive correlation (~0.25) between Bankcard Utilization and Borrower Rate. The plot seems to a positive correlation as expected but it is far too difficult to see. Again adding summary lines will make the trends clearer.
The summary lines show a much clearer positive correlation. They also show why the correlation is weak even though the positive trend is continuous. The variation is quite high as show by the 10% and 90% quantile lines. Even though the trend is upwards, the high degree of variation seems to cause the correlation to be low.
cor.test(loan$AvailableBankcardCredit, loan$BorrowerRate)
##
## Pearson's product-moment correlation
##
## data: loan$AvailableBankcardCredit and loan$BorrowerRate
## t = -119.44, df = 106390, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3491486 -0.3385518
## sample estimates:
## cor
## -0.3438611
The negative correlation between Available Bankcard Credit and Borrower Rate is clearly shown by the summary line. In this case it appears to be an exponentially decreasing trend. At low values of Available Bankcard Credit, the Borrower Rate falls quickly. Between $25,000-50,000 of credit, the trend seems to level off and become far more volatile. This would seem to indicate that having at least $25,000 to $50,000 of Available Credit would be helpful in getting a lower Borrower Rate but beyond that, other factors take over.
Home ownership appears to impact interest rates. Those who own their home have lower median and first and third quartile Borrower Rates. There are also no high end outliers for homeowners and no rates above ~36% while non-homeowners show several extreme rates between 35-50%.
As is expected, the box plot shows that those who are unemployed have the highest Borrower Rates. Self employed people have higher rates than those employed full or part time. Those who are retired seem to have slightly higer rates than full or part time employed individuals, but lower that those who are self employed. Adding the scatter plot with alpha, color and jitter shows how many data points there are in each employment category. There are far fewer loans given to unemployed, retired or part time individuals. Most loans are made to those who indicate that they are employed, employed full-time or are self employed.
There seems to be a faily large difference in the median interest rate depending on what the reason for the loan is. Personal loans, boat purchases, baby & adoption and debt consolidation appear to have the lowest borrower rates. Household expenses and cosmetic procedures have the highest rates. The overlay of the scatterplot also shows which reasons are more common for loans. Debt consolidation, business and home improvement are the most common reasons listed. Boats, RVs, Green Loans and cosmetic procedures are the least frequently listed reasons. This variable would be useful to overlay as a third factor in the multivariate section.
The chart shows a fairly clear trend that professional, high education jobs (judge, engineer, pharmacist, professor) tend to have lower borrower rates while lower skill jobs with lower educational requirements (teachers aide, bus driver, clerical, student, flight attendant) tend to have higher borrower rates.
In this section, more that 2 variables will be investigated to see how they interact. The existing correlations that have been uncovered will be expanded on to try to create plots that provide more insight.
The plot seems to show that those with lower Bankcard Utilization have higher credit scores and pay lower Borrower Rates. Those with high Bankcard Utilization tend to be on the left hand side of the plot with the lower credit scores and the higher borrower rates.
The plot shows that those who own their homes appear to have a higher credit score and appear to have lower borrower rates, especially as credit score increases.
This plot seems to show that for low credit scores, the borrowers employment status does not have a clear effect on the borrower rate. However, it appears that once the credit score reaches around 650, the borrower rate for full-time employed borrowers tends to decrease while the same is not true for those who are not employed. The next plot zooms in on the credit score 650-800 section of the plot.
The trendlines show that the optimal combination to get a low interest rate is to have a significant amount of available credit (more than $50,000) and to have a credit utilization of 25% or less.
This plot took several iterations. The first plot shows that the mean trendline requires some smoothing so Available Bank Credit is rounded to the nearest 100. Few borrowers have more than $10,000 credit so the plot is cut off at $10,000. The final plot switches to version of Current Delinquencies that has been cut into 4 buckets. In the second visual, the dark blue dots dominate the plot and this is because a vast majority of borrowers have 0, 1 or 2 delinquencies (see table(CurrentDelinquencies) in section 1). Cutting the Current Delinquencies variable into 4 buckets makes the plot clearer. Further analysis in section 5.
Listing categories “Not Available”, Debt Consolidation, Business and Home Improvement have a high number of loans and median Borrower Rates on the lower side of half. Auto, “Other” and Hosehold Expenses have the highest number of loans with Borrower Rates on the higher side of half.
summary(loan$ListingCategory)
## 4 - Personal Loan 9 - Boat 0 - Not Available
## 2395 85 16965
## 8 - Baby&Adoption 1 - Debt Consolidation 14 - Large Purchases
## 199 58308 876
## 3 - Business 5 - Student Use 2 - Home Improvement
## 7189 756 7433
## 16 - Motorcycle 12 - Green Loans 11 - Engagement Ring
## 304 59 217
## 6 - Auto 18 - Taxes 19 - Vacation
## 2572 885 768
## 20 - Wedding Loans 15 - Medical/Dental 17 - RV
## 771 1522 52
## 7- Other 13 - Household Expenses 10 - Cosmetic Procedure
## 10494 1996 91
summary(loan$BorrowerRate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
summary(loan$CreditScoreRangeLower)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 660.0 680.0 685.6 720.0 880.0 591
cor.test(loan$BorrowerRate, loan$CreditScoreRangeLower)
##
## Pearson's product-moment correlation
##
## data: loan$BorrowerRate and loan$CreditScoreRangeLower
## t = -175.17, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4661358 -0.4569730
## sample estimates:
## cor
## -0.4615667
This plot reveals some very interesting insights. The “Not Available” listing category shows a very high number of loans and a high number of loans given to borrowers with low credit. The blue coloring corresponds to a low interest rate and the plot clearly shows the most blue coloring on the “Not Available” box plot. This is a strange finding as not giving a reason for a loan could be seen as higher risk, yet low credit scores are getting more reasonable rates when compared with other categories. Having more detailed knowledge about the industry and why this trend may exist would shed some light. Discussing the trend with someone in the industry would be helpful in this scenario. Removing the “Not Available” loans from the data set would likely yeild higher correlations between other variables as that listing category likely skews the dataset for those with low credit scores.
summary(loan$BankcardUtilization)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.310 0.600 0.561 0.840 5.950 7604
cor.test(loan$CreditScoreRangeLower, loan$BankcardUtilization)
##
## Pearson's product-moment correlation
##
## data: loan$CreditScoreRangeLower and loan$BankcardUtilization
## t = -144.58, df = 106330, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4103489 -0.4003027
## sample estimates:
## cor
## -0.405338
cor.test(loan$BorrowerRate, loan$BankcardUtilization)
##
## Pearson's product-moment correlation
##
## data: loan$BorrowerRate and loan$BankcardUtilization
## t = 86.168, df = 106330, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2498551 0.2610917
## sample estimates:
## cor
## 0.255482
The plot shows that those with lower Bankcard Utilization have higher credit scores and pay lower Borrower Rates. The blue is clearly concentrated in the bottom lower right corner of the plot. Those with high Bankcard Utilization tend to be on the left hand side of the plot with the lower credit scores and the higher borrower rates. On the far left side of the plot it shows that there are a few borrowers with very low Bankcard Utilization, low credit and higher borrower rates. This may show borrowers with very little credit or no credit and it is showing their utilization as 0.
summary(loan$CurrentDelinquencies)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.5921 0.0000 83.0000 697
cor.test(loan$CurrentDelinquencies, loan$CreditScoreRangeLower)
##
## Pearson's product-moment correlation
##
## data: loan$CurrentDelinquencies and loan$CreditScoreRangeLower
## t = -133.37, df = 113240, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3734729 -0.3634055
## sample estimates:
## cor
## -0.36845
cor.test(loan$CurrentDelinquencies, loan$AvailableBankcardCredit)
##
## Pearson's product-moment correlation
##
## data: loan$CurrentDelinquencies and loan$AvailableBankcardCredit
## t = -30.276, df = 106370, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.09838769 -0.08647140
## sample estimates:
## cor
## -0.09243285
The mean trendline shows that credit score increases as available credit increases. Layering on current delinquencies shows that those with more than 1 delinquency tend to have very poor credit and tend not to have much credit (blue and purple dots are clustered in the lower left corner between 500-600 credit score and less that $2500 available credit.) This makes sense as individuals with poor credit are not likely to have new loans approved, and those with current delinquencies will have poor credit and will also be rejected for new loans.
The loan dataset has many variables and many of these variables are cross correlated. This makes it very hard to determine cause and effect. Throughout the investigation I tried to answer two questions: - What factors affect interest rate? - What factors affect credit score?
The assumption that I had was that a higher credit score will yield a lower interest rate. This correlation was true however credit score was only a one factor that predicted the interest rate. Several other factors played into the interest rate.
The issue with multivariate plots is that all three variables tend to be correlated which means that you may be showing redundent information. For example, an individual with high available credit is more likely to have a good credit score and also more likely to have a lower interest rate. If I were to repeat the analysis or add to it later, I would look for variables that appear to have no correlation with credit score but do seem to correlate with interest rate. This may yield a multivariate plot that has more explanatory power or a clearer trend because both the x axis variable and the color variable are adding different information to the plot (when they are correlated with each other they are really just displaying somewhat redundant information).